169 research outputs found
The ZCache: Decoupling Ways and Associativity
Abstract—The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically im-proved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways. I
Effects of component-subscription network topology on large-scale data centre performance scaling
Modern large-scale date centres, such as those used for cloud computing
service provision, are becoming ever-larger as the operators of those data
centres seek to maximise the benefits from economies of scale. With these
increases in size comes a growth in system complexity, which is usually
problematic. There is an increased desire for automated "self-star"
configuration, management, and failure-recovery of the data-centre
infrastructure, but many traditional techniques scale much worse than linearly
as the number of nodes to be managed increases. As the number of nodes in a
median-sized data-centre looks set to increase by two or three orders of
magnitude in coming decades, it seems reasonable to attempt to explore and
understand the scaling properties of the data-centre middleware before such
data-centres are constructed. In [1] we presented SPECI, a simulator that
predicts aspects of large-scale data-centre middleware performance,
concentrating on the influence of status changes such as policy updates or
routine node failures. [...]. In [1] we used a first-approximation assumption
that such subscriptions are distributed wholly at random across the data
centre. In this present paper, we explore the effects of introducing more
realistic constraints to the structure of the internal network of
subscriptions. We contrast the original results [...] exploring the effects of
making the data-centre's subscription network have a regular lattice-like
structure, and also semi-random network structures resulting from parameterised
network generation functions that create "small-world" and "scale-free"
networks. We show that for distributed middleware topologies, the structure and
distribution of tasks carried out in the data centre can significantly
influence the performance overhead imposed by the middleware
INFaaS: A Model-less and Managed Inference Serving System
Despite existing work in machine learning inference serving, ease-of-use and
cost efficiency remain challenges at large scales. Developers must manually
search through thousands of model-variants -- versions of already-trained
models that differ in hardware, resource footprints, latencies, costs, and
accuracies -- to meet the diverse application requirements. Since requirements,
query load, and applications themselves evolve over time, these decisions need
to be made dynamically for each inference query to avoid excessive costs
through naive autoscaling. To avoid navigating through the large and complex
trade-off space of model-variants, developers often fix a variant across
queries, and replicate it when load increases. However, given the diversity
across variants and hardware platforms in the cloud, a lack of understanding of
the trade-off space can incur significant costs to developers.
This paper introduces INFaaS, a managed and model-less system for distributed
inference serving, where developers simply specify the performance and accuracy
requirements for their applications without needing to specify a specific
model-variant for each query. INFaaS generates model-variants, and efficiently
navigates the large trade-off space of model-variants on behalf of developers
to meet application-specific objectives: (a) for each query, it selects a
model, hardware architecture, and model optimizations, (b) it combines VM-level
horizontal autoscaling with model-level autoscaling, where multiple, different
model-variants are used to serve queries within each machine. By leveraging
diverse variants and sharing hardware resources across models, INFaaS achieves
1.3x higher throughput, violates latency objectives 1.6x less often, and saves
up to 21.6x in cost (8.5x on average) compared to state-of-the-art inference
serving systems on AWS EC2
Measuring Latency: Am I doing it right?
This poster describes the basic methodology to conduct an accurate latency experiment
- …